We provide IT Staff Augmentation Services!

Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Dallas, TX

SUMMARY

  • Overall 8+ years of IT development experience, including experience in Big Data, Python, Hadoop, Scala, Apache Spark, SQL and Cloud technologies.
  • Experienced in using Agile methodologies including extreme programming, SCRUM and Test - Driven Development (TDD)
  • Proficient in using Hive optimization techniques like Buckets, Partitions, etc.
  • Experienced in loading dataset intoHiveforETL(Extract, Transfer and Load) operation.
  • Experience inimporting and exportingdata usingSqoopfrom Relational Database Systems toHDFSand vice - versa.
  • Involved in writingdata transformations, data cleansingusingPIG operationsand good experience in data retrieving and processing usingHIVE.
  • Experienced in developing Web Services with Python programming language and Good working experience in processing large datasets with Spark using Scala and Pyspark.
  • Experience working with Amazon's AWS services likeEC2, EMR, S3, KMS, Kinesis, Lambda, API gateways, IAM etc.
  • Excellent noledge onHadoopArchitecture such asHDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduceprogramming paradigm.
  • Worked withHBaseto conduct quick look ups (updates, inserts and deletes) in Hadoop.
  • Experience inOozieand workflow scheduler to manage Hadoop jobs byDirect Acyclic Graph (DAG)of actions with control flows.
  • Extensive Experience on importing and exporting data using stream processing platforms likeFlume.
  • Developed ApacheSparkjobs using Scala andPythonfor faster data processing and used Spark Core and Spark SQL libraries for querying.
  • Experience in dealing with Apache Hadoop components likeHDFS, MapReduce, HIVE, HBase, PIG, SQOOP, Spark and Flume Big Data and Big Data Analytics
  • Experience in developing MapReduce Programs using Apache Hadoop for analyzing teh big data as per teh requirement.
  • Experience in Microsoft Azure/Cloud Services like SQL Data Warehouse, Azure SQL Server, Azure Databricks, Azure Data Lake, Azure Blob Storage, Azure Data Factory
  • Hands on experience in data processing automation using python.
  • Extensive experience usingMAVENas a Build Tool for teh building of deployable artifacts from source code
  • Experience in creatingSparkStreaming jobs to process huge sets of data in real time.
  • Experience in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, SAS and Python and creating dashboards using tools like Tableau.
  • Proficient in usage of tools likeErwin (Data Modeler, Model Mart, navigator),ER Studio,IBM Meta Data Workbench, Oracle data profiling tool, Informatica, Oracle Forms, Reports, SQL*Plus, Toad, Crystal Reports.
  • Expertise in relational database systems (RDBMS) such as My SQL, Oracle, MS SQL, and No SQL database systems like Hbase, MongoDB and Cassandra.

TECHNICAL SKILLS

Big Data Technologies: Hadoop, HDFS, Hive, MapReduce, Pig, Sqoop, Flume, Kafka, HBase, PySpark, Spark, Tera Data, Snowflake, Databricks

Programming Languages: Python, Scala Programming, SQL

Databases/RDBMS: MySQL, SQL/PL-SQL, MS-SQL Server, Oracle

NO SQL: Mongo DB, Hbase, Cassandra

Scripting/ Web Languages: HTML5, CSS3, XML, SQL, Shell

ETL Tools: Cassandra, HBASE, Elastic search, Informatica

Operating Systems: Linux, Windows, Mac OS

Software Life Cycles: SDLC, Waterfall and Agile models

Office Tools: Microsoft tools and Risk Analysis tools, Visio

Cloud Platforms: Amazon Web Services (AWS), Microsoft Azure

PROFESSIONAL EXPERIENCE

Confidential, Dallas, TX

Big Data Engineer

Responsibilities:

  • Installation, Configuration and deployment of product soft wares on new edge nodes dat connect and contact Kafka cluster for data acquisition.
  • Worked on Big data on AWS cloud services me.e. EC2, S3, EMR and DynamoDB
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython. tables and connected to Tableau for generating interactive reports using Hive server2.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Authoring Python Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labelling and for all Cleaning and conforming tasks.
  • Using Flume, Kafka and Spark streaming to ingest real time or near real time data in HDFS.
  • Worked on Big Data Integration &Analytics based on Hadoop, Spark, Kafka, and web Methods.
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
  • Installing, configuring and maintaining Data Pipelines
  • Worked on analyzing Hadoop cluster using different big data analytic tools including Flume, Pig, Hive, HBase, Oozie, Zookeeper, Sqoop, Spark and Kafka.
  • Played a key role inmigrating Cassandra, Hadoop cluster on AWS and defined different read/write strategies
  • Written multiple Map Reduce program in Java for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV and other compressed file formats.
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
  • Experienced in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and SCD (Slowly changing dimension)
  • Created scripts to readCSV, json and parquet filesfrom S3 buckets inPythonand load intoAWS S3 and Snowflake. Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on teh fly to build teh common learner data model and persists teh data in HDFS.
  • Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
  • Worked on analysis tool like Tableau for regression analysis, pie charts, and bar graphs.
  • Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
  • Used Spark Streaming to receive real time data from teh Kafka and store teh stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra

Environment: AWS, Hadoop, Hive, S3, Kafka, Pyspark, HDFS, Scrum, Git, Sqoop, Oozie, Pyspark, Informatica, Tableau, SQL Server, Python, Shell Scripting, HBase, Cassandra, Informatica, XML

Confidential, Vernon Hills, IL

Big Data Engineer

Responsibilities:

  • Data Ingestion to one or more cloud Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and cloud migration processing teh data in in Azure Databricks
  • Worked with hundreds of terabytes of data collections from different loan applications into HDFS.
  • Developed and designed system to collect data from multiple portal using Kafka and tan process it using spark.
  • Developed application to refresh Tableau reports using automated trigger API
  • Develop programs in Spark to use on application for faster data processing than standard MapReduce programs.
  • Utilized Apache Spark with Python to develop and execute Big Data Analytics.
  • Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
  • Responsible for designing and developing data ingestion from Kroger using Apache NiFi/Kafka.
  • Worked on UDFS using Python for data cleansing
  • Implemented Spark Scripts usingScala,Spark SQLto access hive tables into spark for faster processing of data
  • Developed story telling dashboards inTableauDesktop and published them on toTableauServer which allowed end users to understand teh data on teh fly with teh usage of quick filters for on demand needed information.
  • Used Python to extract, transform & load source data from transaction systems, generated reports, insights, and key conclusions
  • Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage
  • Design/Implement large scale pub-sub message queues using Apache Kafka
  • Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
  • Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming teh data to uncover insights into teh customer usage patterns.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory and Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data in Azure Databricks.
  • Skilled in data visualization like Matplotlib and seaborn library
  • Develop near real time data pipeline using flume, Kafka and spark stream to ingest client data from their web log server and apply transformation
  • Performs Code review and troubleshooting of existing Informatica mappings and deployment of code from Development to test to production environment.
  • Built teh infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources using SQL and big data technologies likeHadoop Hive, Azure Data Lake storage
  • Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
  • ImplementedSparkusingScalaandSpark SQLfor faster testing and processing of data.
  • Experience implementing machine learning back-end pipeline with Pandas, NumPy
  • Worked in creating POCs for multiple business user stories using Hadoop ecosystem.
  • Encoded and decoded Json objects using PySpark to create and modify teh data frames in Apache Spark

Environment: Python, Tableau, SQL, Spark, Kafka, Power BI, Microsoft Azure, Hadoop, Hortonworks, HDFS, HBase, Oozie, Scala, Tableau, Apache airflow, Jira, PySpark, Informatica

Confidential, Bay area, CA

Data Engineer/ Hadoop Engineer

Responsibilities:

  • Imported teh data from different sources likeHDFS/HBaseintoSpark RDDand developed a data pipeline usingKafkaand Storm to store data intoHDFS.
  • Involved in different phases of Development life includingAnalysis, Design, Coding, Unit Testing,Integration Testing, ReviewandReleaseas per teh business requirements.
  • DevelopedSparkcode usingScalafor faster testing and processing of data.
  • ApacheHadoopinstallation & configuration of multiple nodes onAWS EC2system
  • Involved inKafkaand building use case relevant to our environment.
  • DevelopedOozieworkflow jobs to executehive, SqoopandMapReduceactions.
  • Architected, Designed and Developed Business applications andData martsforreporting.
  • Involved in building an information pipeline and performed analysis utilizing AWS stack (EMR, EC2, S3, RDS, Lambda, Glue, SQS, and Redshift).
  • CreatedHive Externaltables to stage data and tan move teh data from Staging to main tables
  • Implemented theBig Datasolution usingHadoop, hiveandInformaticato pull/load teh data into theHDFSsystem.
  • Documented teh requirements including teh available code which should be implemented usingSpark, Hive, HDFS, HBase and Elastic Search.
  • Participated in requirements sessions to gather requirements along with business analysts and product owners.
  • Created Integration Relational3NF modelsdat can functionally relate to other subject areas and responsible to determine transformation rules accordingly in teh Functional Specification Document.
  • Responsible for developing data pipeline usingflume, Sqoopandpigto extract teh data from weblogs and store inHDFS.
  • Pulling teh data fromdata lake (HDFS)and massaging teh data with variousRDD transformations.
  • Collaborated with Business users for requirement gathering for buildingTableau reportsper business needs.
  • Developed continuous flow of data intoHDFSfrom social feeds usingApacheStorm Spouts and Bolts.
  • DevelopedBig Datasolutions focused on pattern matching and predictive modeling
  • Objective of dis project is to build a data lake as a cloud based solution inAWSusingApache Spark.

Environment: Hadoop, YARN, HDFS, Spark, 3NF, flume, Sqoop, pig, MapReduce, UNIX, Zookeeper HBase, Kafka, Scala, NoSQL, Cassandra, Elastic Search, Sqoop.

Confidential

Hadoop Developer

Responsibilities:

  • Analyzed teh data by performing Hive queries and running Pig scripts to study customer behavior.
  • Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Experienced in managing and reviewing teh Hadoop log files usingShell scripts.
  • Developed Flume Agents for loading and filtering teh streaming data intoHDFS.
  • Handling continuous streaming data comes from different sources usingFlumeand set destination asHDFS.
  • Developed job workflow inOozieto automate teh tasks of loading teh data intoHDFSand few otherHivejobs.
  • UsedAWS S3to store large amount of data in identical/similar repository.
  • Responsible for coding Java Batch, Restful Service,MapReduceprogram, Hive query's, testing, debugging, Peer code review, troubleshooting and maintain status report.
  • Involved in collecting, aggregating and moving data from servers toHDFSusingFlume.
  • DevelopedOozie workflowfor scheduling and orchestrating theETLprocess.
  • DevelopedPigLatin scripts to extract and filter relevant data from teh web server output files to load into HDFS.
  • UsedHiveto analyze teh partitioned and bucketed data and compute various metrics for reporting.
  • Experience in writing customMapReduceprograms &UDF's in Java to extendHiveandPigcore functionality.

Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, AWS, Flume, Oozie, HBase, Sqoop, RDBMS/DB, Flat files, MySQL

Confidential

Data Stage Developer

Responsibilities:

  • Used teh DataStage Designer to develop processes for Extracting, Cleansing, Transforming, Integrating, and Loading data into Data warehouse
  • Extensively used Processing Stage like Lookup Stage to perform lookup operations based on teh various target tables, modify stage to alter teh record schema of teh input data set, Funnel Stage to combine various datasets into a single large dataset and Switch stage to trigger teh required output based on a specific condition.
  • Converted teh design into well-structured and high quality DataStage jobs.
  • Developed PL/SQL Procedures, Functions, Packages, Triggers, Normal and Materialized Views.
  • Worked with Reporting team for extensively reporting using Data mart for Slice & Dice, Drill Down and Drill through.
  • Performed ETL Performance tuning to increase teh ETL process speed.
  • Designed Data Stage parallel jobs using designer to extract data from various source systems, Transformation and conversion of data, Load data-to-data warehouse and Send data from warehouse to third party systems like Mainframe.

Environment: Data stage, ETL, SQL, Python, Pl/Sql, Sql Server, Git.

We'd love your feedback!