We provide IT Staff Augmentation Services!

Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Westlake, TX

SUMMARY

  • Dynamic and motivated IT professional wif around 8 years of experience as a Big Data Engineer wif expertise in designing data intensive applications usingHadoopEcosystem,BigData,Analytical,Cloud,Datawarehouse/DataMart,Data Visualization, Dataengineering Reporting, and Data Quality solutions.
  • In - depth noledge of Hadoop architecture and its components likeYARN, HDFS, Name Node, Data Node, Job Tracker, Application Master,Resource Manager, Task Tracker and Map Reduce programming paradigm.
  • Extensive experience in Hadoop led development of enterprise level solutions utilizing Hadoop components such asApache Spark, MapReduce, HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.
  • Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations). Strong Knowledge on Architecture of Distributed systems and Parallel processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.
  • Expert in designingETLdata flows using creating mappings/workflows to extract data fromSQLServerand Data Migration and Transformation from Oracle/Access/Excel Sheets using SQL ServerSSIS.
  • Expert in designing Parallel jobs using various stages like Join, Merge, Lookup, remove duplicates, Filter, Dataset, Lookup file set, Complex flat file, Modify, Aggregator, XML.
  • Hands-on experience wifAmazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, CloudWatch, SNS, SES, SQS, Lambda, EMRand other services of teh AWS family.
  • Experienced wif teh Spark improving teh performance and optimization of teh existing algorithms in Hadoop usingSpark Context,Spark-SQL,Dataframe API,Spark Streaming, MLlib, PairRDD's and worked explicitly onPySparkandScala.
  • Handled ingestion of data from different data sources into HDFS using Sqoop, Flume and perform transformations using Hive, Map Reduce and then loading data into HDFS. Managed Sqoop jobs wif incremental load to populate HIVE external tables. Experience in importing streaming data into HDFS usingFlume sources, andFlume sinksand transforming teh data usingFlume interceptors.
  • Experience inOozieand workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions wif control flows.
  • Experience of Partitions, bucketing concepts in Hive and designed both Managed and External tables
  • Expertise in Creating, Debugging, Scheduling and Monitoring jobs using Airflow and Oozie. Experienced wif using most common Operators in Airflow -Python Operator, Bash Operator, Google Cloud Storage Download Operator, Google Cloud Storage Object Sensor, GoogleCloudStorageToS3Operator.
  • Hands-on experience in handling database issues and connections wif SQL and NoSQL databases such asMongoDB,HBase,Cassandra,SQLserver, andPostgreSQL. Created Java apps to handle data in MongoDB and HBase. Used Phoenix to create SQL layer on HBase.
  • Experience in designing and creatingRDBMSTables, Views, User Created Data Types, Indexes, Stored Procedures, Cursors, Triggers and Transactions.
  • Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an under-standing of how to integrate wif other Azure Services.
  • Experience wif GCP computing services like AppEngine and Cloud Functions.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, GCP, HBase, Kafka, Oozie, Apache Spark, Zookeeper, NiFi, Amazon Web Services, Customer 360.

Machine Learning: Decision Tree, LDA, Linear and Logistic Regression, Random Forest, Clustering: K-NN, K-Means, Neural Networks, ANN & RNN, PCA, SVM, NLP, Deep learning.

Python Libraries: NLP, pandas, NumPy, Seaborn, SciPy, Matplotlib, sci-kit-learn, Beautiful Soup, PySpark

Operating System: Linux (Centos, Ubuntu), Windows (XP/7/8/10)

Languages: Java, Shell scripting, Pig Latin, Scala, Python, R, C++

Databases: SQL Server, MySQL, Teradata, DB2, Oracle, Databricks

NoSQL: HBase, Cassandra and MongoDB

Hadoop Technologies and Distributions: Apache Hadoop Cloudera CDH5.13, MapR, PySpark

Visualization/Reporting: Power BI, Tableau, ggplot2, matplotlib

Versioning Tools: SVN, Git, GitHub

PROFESSIONAL EXPERIENCE

Confidential, Westlake, TX

Big Data Engineer

Responsibilities:

  • Designed and setup Enterprise Data Lake to provide support for various uses cases including Analytics, processing, storing, and reporting of voluminous, rapidly changing data.
  • Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely wif teh stakeholders & solution architect.
  • Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
  • Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and MapReduce to access cluster for new users.
  • Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.
  • Implemented teh machine learning algorithms using Python to predict teh quantity a user might want to order for a specific item so we can automatically suggest using Kinesis firehose and S3 data lake.
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
  • Used Spark SQL for Scala & amp, Python interface dat automatically converts RDD case classes to schema RDD.
  • Import teh data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate teh output response.
  • Created Lambda functions wif Boto3 to deregister unused AMIs in all application regions to reduce teh cost for EC2 resources.
  • Worked on AWS Services like AWS SNS to send out automated emails and messages using BOTO3 after teh nightly run
  • Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages).
  • Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates.
  • Developed reusable framework to be leveraged for future migrations dat automates ETL from RDBMS systems to teh Data Lake utilizing Spark Data Sources and Hive data objects.
  • Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
  • Developed Kibana Dashboards based on teh Log stash data and Integrated different source and target systems into Elasticsearch for near real time log analysis of monitoring End to End transactions.
  • Implemented AWS Step Functions to automate and orchestrate teh Amazon SageMaker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
  • Integrated Apache Airflow wif AWS to monitor multi-stage ML workflows wif teh tasks running on Amazon Sage Maker.
  • Monitored containers in AWS EC2 machines using Datadog API and ingest, enrich data into teh internal cache system.

Confidential, Watertown, PA

Azure Data Engineer

Responsibilities:

  • Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
  • Managed, Configured, and scheduled resources across teh cluster using Azure Kubernetes Service.
  • Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Datawarehouse and improved teh query performance.
  • Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked wif Cosmos DB (SQL API and Mongo API).
  • Develop dashboards and visualizations to halp business users analyze data as well as providing data insight to upper management wif a focus on Microsoft products like SQL Server Reporting Services (SSRS) and Power BI.
  • Knowledge of U-SQL and how it can be used for data transformation as part of a cloud data integration strategy
  • Performed teh migration of large data sets to Databricks (Spark), create and administer cluster, load data, configure data pipelines, loading data from ADLS Gen2 to Databricks using ADF pipelines.
  • CreatedLinked serviceto land teh data from SFTP location to Azure Data Lake.
  • Created various pipelines to load teh data from Azure data lake into Staging SQLDB and followed by to Azure SQL DB
  • Created Databrick notebooks to streamline and curate teh data for various business use cases and mounted blob storage on Databrick.
  • Developed streaming pipelines usingApache SparkwifPython
  • Created pipelines, data flows and complex data transformations and manipulations usingADFandPySparkwifDatabricks.
  • Utilized Azure Logic Apps to build workflows to schedule and automate batch jobs by integrating apps, ADF pipelines, and other services like HTTP requests, email triggers etc.
  • Worked extensively on Azure Data factory including data transformations, Integration Runtimes, Azure Key Vaults, Triggers and migrating data factory pipelines to higher environments using ARM Templates.
  • Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
  • Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
  • To meet specific business requirements wrote UDF’s inScalaandPySpark.
  • Developed JSON Scripts for deploying teh Pipeline in Azure Data Factory (ADF) dat process teh data using teh SQL Activity.
  • Hands-on experience on developing SQL Scripts for automation purpose.
  • Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS)
  • Designed and developed a new solution to process teh NRT data by usingAzure Stream Analytics, Azure Event Hub, and Service Bus Queue.
  • Working wif complexSQL views, Stored Procedures, Triggers, and packages in large databases from various servers.

Confidential, Atlanta,GA

Big Data Developer

Responsibilities:

  • Understanding business needs, analyzing functional specifications and map those to develop and designingMapReduceprograms and algorithms
  • Developed a data pipeline usingKafkaandStormto store data intoHDFS
  • CustomizedFlumeinterceptors to encrypt and mask customer sensitive data as per requirement
  • Recommendations using Item Based Collaborative Filtering inApache Spark.
  • Worked wif NoSQL databases likeHBasein creating HBase tables to load large sets of semi structured data coming from various sources.
  • UsedKibana, which is an open source-based browser analytics and search dashboard forElasticsearch.
  • Developed iterative algorithms usingSpark StreaminginScalafor near real-time dashboards.
  • Involved in customizing teh partitioner inMapReduceto root Key value pairs from Mapper to Reducers in XML format according to requirement.
  • ConfiguredFlumefor efficiently collecting, aggregating, and moving large amounts of log data.
  • Involved in creating Hive tables, loading teh data using it and in writing Hive queries to analyze teh data
  • Experienced in migrating Hive QL intoImpalato minimize query response time.
  • Used Java 8 streams and lambda expressions to increase performance.
  • Involved in schedulingOozieworkflow engine to run multiple Hive and pig jobs
  • Designed and built teh Reporting Application, which uses theSpark SQLto fetch and generate reports onHBasetable data.
  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinating tasks among teh team.
  • Worked on batch processing of data sources usingApache Spark, Elastic search
  • Extracted teh needed data from teh server intoHDFSand Bulk Loaded teh cleaned data intoHBase.
  • Used different file formats like Text files, Sequence Files, Avro, Record Columnar CRC, ORC
  • Designed and developed ETL packages using SQL Server Integration Services (SSIS) to load teh data from SQL server, XML files to SQL Server database through custom C# script tasks.
  • Develop and deploy teh outcome using Spark and Scala code in Hadoop cluster running on GCP.
  • Designed and documented teh error-handling strategy in teh ETL load process. Prepared teh complete ETL specification document for all teh ETL flows.
  • Used Cassandra CQL wif Java APIs to retrieve data fromCassandratables and experience in storing teh analyzed results back into theCassandracluster.
  • Using Rest API wif Python to ingest Data from some other site to BIGQUERY.
  • Using g-cloud function wif Python to load Data into Big Query for on arrival csv files in GCS bucket.
  • Implemented CRUD operations involving lists, sets and maps inDataStaxCassandra.
  • Have worked on ETL tools like SSIS for Data Collection and updating Longitudinal Databases and SPSS for Data Analysis and Modeling.

Confidential, Marlborough, MA

Big Data Engineer

Responsibilities:

  • Interacted wif business partners, Business Analysts, and product owner to understand requirements and build scalable distributed data solutions using Hadoop ecosystem.
  • Developed Spark Streaming programs to process near real time data from Kafka, and process data wif both stateless and state full transformations.
  • Worked wif HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing, and optimizing teh HQL queries.
  • Built and implemented automated procedures to split large files into smaller batches of data to facilitate FTP transfer which reduced 60% of execution time.
  • Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
  • Developing Spark scripts, UDFS using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.
  • Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other compressed file formats Codecs like gZip, Snappy, Lzo.
  • Strong understanding of Partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Developed PIG UDFs for manipulating teh data according to Business Requirements and worked on developing custom PIG Loaders.
  • Developing ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL. Writing SQL queries against Snowflake.
  • Experience in report writing using SQL Server Reporting Services (SSRS) and creating various types of reports like drill down, Parameterized, Cascading, Conditional, Table, Matrix, Chart and Sub Reports.
  • Used DataStax Spark connector which is used to store teh data into Cassandra database or get teh data from Cassandra database.
  • Worked on implementation of a log producer in Scala dat watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
  • Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on teh dashboard.
  • Transformed teh data using AWS Glue dynamic frames wif PySpark; cataloged teh transformed teh data using Crawlers and scheduled teh job and crawler using workflow feature
  • Created AWS Lambda, EC2 instances provisioning on AWS environment and implemented security groups, administered Amazon VPC's.
  • Used Jenkins pipelines to drive all micro-services builds out to teh Docker registry and then deployed to Kubernetes, created Pods and managed using Kubernetes
  • Developed data pipeline programs wif Spark Scala APIs, data aggregations wif Hive, and formatting data (JSON) for visualization, and generating.

Confidential

Data Analyst

Responsibilities:

  • Used Performed Data analysis, Data cleaning, Data transformations and Data modeling in R and Python.
  • Experimented wif various predictive models including Logistic Regression, Support Vector Machine (SVM), Random Forest, XGBoost, Decision trees to check teh model performances and accuracies.
  • Analyzed and solved business problems and found patterns and insights wifin structured and unstructured data.
  • Designed logical and physical data models for multiple OLTP and Analytic applications.
  • Involved in analysis of business requirements and keeping track of data available from various data sources, transform and load teh data into Target Tables using Informatica Power Center.
  • Worked on outlier's identification wif box - plot, K-means clustering using Pandas, NumPy, matplotlib, seaborn.
  • Generated teh reports and visualizations based on teh insights mainly using Tableau and developed dashboards
  • Built text classifier on teh data glossary using TF-IDF to construct a feature space, implanted Naive-Bayes algorithm and deployed it using REST API and Flask.
  • Extensively used Star Schema methodologies in building and designing teh logical data model into Dimensional Models.
  • Used Apache Spark in handling huge sets of data and built Machine learning models using SparkML libraries.
  • Performed Data pulls to get teh from AWS S3 buckets.
  • Built robust Machine Learning models using bagging and boosting methods.
  • Created stored procedures using PL/SQL and tuned teh databases and backend process.
  • Involved wif Data Analysis Primarily Identifying Data Sets, Source Data, Source Meta Data, Data Definitions and Data Formats.
  • Done Performance tuning of teh database, which includes indexes, and optimizing SQL statements, monitoring teh server.
  • Developed Informatica mappings, sessions, workflows and have written Pl SQL codes for TEMPeffective and optimized data flow coding.
  • Wrote SQL Queries, Dynamic-queries, sub-queries and complex joins for generating Complex Stored Procedures, Triggers, User-defined Functions, Views and Cursors.
  • Wrote simple and advanced SQL queries and scripts to create standard and ad hoc reports for senior managers.

We'd love your feedback!